The video visualization comparison across different data cleaning strategies reveals distinct outcomes Videos generated by strategy-1 fail to capture local motion and tend to be stationary. In contrast, videos produced by strategy-2 exhibit excessive motion, compromising semantic coherence. Meanwhile, videos generated by strategy-3 effectively model intricate hand movements, striking a balance between motion strength and semantic fidelity.